The word entropy of natural languages
نویسندگان
چکیده
The average uncertainty associated with words is an informationtheoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across more than 1000 languages. Our results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to normalize semantic similarity measures.
منابع مشابه
The Entropy of Words - Learnability and Expressivity across More than 1000 Languages
The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics, and language sciences more generally. Information-theory gives us tools at hand to measure precisely the average amount of choice associated with words – the word entropy. Here we use three parallel corpora – encompassing ca. 450 million w...
متن کاملComplexity measurement of natural and artificial languages
We compared entropy for texts written in natural languages (English, Spanish) and artificial languages (computer software) based on a simple expression for the entropy as a function of message length and specific word diversity. Code text written in artificial languages showed higher entropy than text of similar length expressed in natural languages. Spanish texts exhibit more symbolic diversit...
متن کاملWord-Forming Process in Azeri Turkish Language
The subject intended to study the general methods of natural word-forming in Azeri Turkish language. This study aimed to reach this purpose by analyzing the construction of compound Azeri Turkish words. Same’ei (2016) did a comprehensive study on word-forming process in Farsi, which was the inspiration source of this study for Azeri Turkish language word-forming. Numerous scholars had done vari...
متن کاملDiscovery of Kolmogorov Scaling in the Natural Language
Abstract: We consider the rate R and variance σ2 of Shannon information in snippets of text based on word frequencies in the natural language. We empirically identify Kolmogorov’s scaling law in σ2 ∝ k−1.66±0.12 (95% c.l.) as a function of k = 1/N measured by word count N. This result highlights a potential association of information flow in snippets, analogous to energy cascade in turbulent ed...
متن کاملUniversal Entropy of Word Ordering Across Linguistic Families
BACKGROUND The language faculty is probably the most distinctive feature of our species, and endows us with a unique ability to exchange highly structured information. In written language, information is encoded by the concatenation of basic symbols under grammatical and semantic constraints. As is also the case in other natural information carriers, the resulting symbolic sequences show a deli...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1606.06996 شماره
صفحات -
تاریخ انتشار 2016